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Abstract 

Bottom-up saliency, an early stage of human visual attention, can be consid- 
ered as a binary classification problem between centre and surround classes. 
Discriminant power of features for the classification is measured as mutual in- 
formation between distributions of image features and corresponding classes 
. As the estimated discrepancy very much depends on considered scale level, 
multi-scale structure and discriminant power are integrated by employing 
discrete wavelet features and Hidden Markov Tree (HMT). With wavelet co- 
efficients and Hidden Markov Tree parameters, quad-tree like label structures 
are constructed and utilized in maximum a posterior probability (MAP) of 
hidden class variables at corresponding dyadic sub-squares. A saliency value 
for each square block at each scale level is computed with discriminant power 
principle. Finally, across multiple scales is integrated the final saliency map 
by an information maximization rule. Both standard quantitative tools such 
as NSS, LCC, AUC and qualitative assessments are used for evaluating the 
proposed multi-scale discriminant saliency (MDIS) against the well-know in- 
formation based approach AIM on its released image collection with eye- 
tracking data. Simulation results are presented and analysed to verify the 
validity of MDIS as well as point out its limitation for further research di- 
rection. 
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1. Visual Attention - Computational Approach 

Visual attention is a psychological phenomenon in which human visual 
systems are optimized for capturing scenic information. Robustness and effi- 
ciency of biological devices, the eyes and their control systems, visual paths 
in the brain have amazed scientists and engineers for centuries. From Neisser 
[1] to Marr [2], researchers have put intensive effort in discovering attention 
principles and engineering artificial systems with equivalent capability. For 
decades, this research field has been dominated by visual attention principle 
, proposing an existence of a saliency map for attention guidance. The idea is 
further promoted in Feature Integration Theory (FIT) [3] which elaborates 
computational principles of saliency map generation with centre-surround 
operators and basic image features such as intensity, orientation and colour. 
Then, Itti et al. jl] implemented and released the first complete computer 
algorithms of FIT theory [j] 

Feature Integration Theory is widely accepted as principles behind vi- 
sual attention partly due to its utilization of basic image features such as 
colour, intensity, and orientation. Moreover, this hypothesis is supported by 
several evidences from psychological experiments. However, it only defines 
theoretical aspects of saliency maps and visual attention , but does not inves- 
tigate how such principles would be implemented algorithmically. This lack 
of implementation details leaves research field open for many later saliency 
algorithms [U El El E], etc. Saliency might be computed as a linear contrast 
between features of central and surrounding environments across multiple 
scales by centre-surround operators. Saliency is also modelled as phase dif- 
ference in Fourier Transform Domain [8], or its value depends on statistical 
modelling of the local feature distribution [6] . Though many approaches are 
mentioned in the long and rich literature of visual saliency, only a few are 
built on a solid theory or linked to other well-established computational the- 
ory. Among the approaches, Neil Bruce's work [§] nicely established a bridge 
between visual saliency and information theory. It puts a first step for bridg- 
ing two alien fields; moreover, visual attention for first time could be viewed 
as information system. Then, information based visual saliency has con- 
tinuously been investigated and developed in several works [TQl Ell EE2l E3] . 
The distinguish points between these works are computational approaches 
for retrieving information from features. The process attracts much inter- 
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est due to challenges in estimating information of high-dimensional data like 
2-D image patches. It usually runs into computational problems which can 
not be efficiently solved due to the curse of dimensionality; moreover, cen- 
tral and surrounding contexts are usually defined in ad-hoc manners without 
much theoretical supports. To tackles the problems, Danash Gao et al. has 
simplified the information extraction step as a binary classification problem 
with decision theory . Two classes are identified as centre and surround con- 
texts then their discriminant power or mutual information between features 
and classes are estimated as saliency values for each location. This formula- 
tion is named as Discriminant Saliency (DIS) of which underlying principles 
are carefully elaborated by Gao et al. [H]. Its significant point is estimat- 
ing information from class distributions rather than from the input features 
themselves. Therefore, computational load is greatly reduced as only simple 
class distribution need estimating rather than complex feature distribution. 

Spatial features have large influence on saliency values; however, scale- 
space features do have decisive role in visual saliency computation since cen- 
ter or surround environments are simply processing windows with different 
sizes. In signal processing, scale-space and spectral space are two sides of 
a coin; therefore, there is a strong relation between scale- frequency-saliency 
in visual attention problem. Several researchers [151 HSJ El HB] outlined 
that fixated regions have high spatial contrast or showed that high frequency 
edges allow stronger discrimination between fixated over non-fixated points. 
In brief, they all come up with one conclusion: increment in predictability 
at high-frequency features. Although these studies emphasizes a greater vi- 
sual attraction to high frequencies (edges, ridges, other structures of images), 
there are other works focusing on medium frequency. Bruce et al. [TH] pro- 
pose that fixation points tend to prefere horizontal and vertical frequency 
content rather than random position, and these oriented contents have more 
noticeable difference in medium frequencies. More interestingly, choices of 
frequency range for biological vision may depend on encountering visual con- 
text [20J . For example, luminance contrast explains fixation locations better 
in natural image category and slightly worse in a category of urban scenes 
provided that all images are applied low-pass filters as preprocessing steps. 
Perhaps, that attention system may include different range of frequencies for 
optimal eye-movements. In other words, diversity in spectral space means 
necessary utilization of scale-space theory and visual attention. It can be 
assumed that both high (small scale) and medium frequency (medium scale) 
constitutes an ecological relevance and balances between information require- 
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ment and available attentional capacity in the early stage of visual attention 
when observers are not driven by performing any specific tasks. 

Though multi-scale nature has been emphasized as the implicit part of 
human visual attention, it is often ignored in several saliency algorithms. 
For example, DIS approach [H] considers only one fixed-size window; hence, 
it may lead to inconsideration of significant attentive features in a scene. 
Therefore, DIS approach needs constituting under the multiple level frame- 
work to form multiscale discriminant saliency (MDIS) approach. This is the 
main motivation as well as contribution of this paper which are organized 
as follows. Section [2] reviews principles behind DIS [13J and focuses on its 
important assumption and limitation. Section [3] provides alternative inter- 
pretation of center-surround operation in statistical manner; moreover, it 
also describes the operation in the light of "large variance" / "small vari- 
ance" Gaussian distribution as well as "implicit" / "explicit" manifolds con- 
cepts. After that, MDIS approach is carefully elaborated in section [4] with 
several relevant contents such as multiple dyadic windows for binary classifi- 
cation problem in subsection 4.1[ multi-scale statistical modelling of wavelet 



coefficients and learning of parameters in sub-sections 4.2, |4.3[ maximum 



likelihood (MLL) and maximum a posterior probability (MAP) computation 
of dyadic sub-squares in subsections |4.4[ |4.5| Then, all MDIS steps are com- 



bined for final saliency map generation in subsection 4.6 Quantitative and 



qualitative analysis of the proposed method with different modes are dis- 
cussed in section [5j moreover, comparisons of the proposed MDIS and the 
well-known information-based saliency method AIM [9] simulation data are 
presented with several interesting conclusions. Finally, main contributions of 
this paper as well as further research direction are stated in the conclusion 
section [61 



2. Visual Attention - Discriminant Saliency 

Saliency mechanism plays a key role in perceptual organization, recently 
several attempts are made to generalize principles for visual saliency. In 
the decision theoretic point of view, saliency is regarded as power for dis- 
tinguishing salient and non-salient classes; moreover, it combines classical 
centre-surround hypothesis with derived optimal saliency architecture. In 
other word, saliency of each image location is identified by the discrimi- 
nant power of a feature set with respect to the binary classification problem 
between center and surround classes. Based on decision theory, this discrim- 
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inant saliency detector can work with variety of stimulus modalities, includ- 
ing intensity, color, orientation and motion. Moreover, various psychophysics 
property for both static and motion stimuli are shown to be accurately sat- 
isfied quantitatively by DIS saliency maps. 

Perceptual systems evolve for optimally producing decisions about the 
state of surrounding environments with minimum probability of error in a 
decision-theoretic sense as well as minimum computational effort. In order 
to achieve these goals, artificial systems needs a mathematical framework 
for optimization. Mathematically, the problem can be defined as (1) a bi- 
nary classification of interest stimuli and null hypothesis (salient against non- 
salient features) and (2) measurement of discriminant power from extracted 
visual features as saliency at each location in the visual field. The discrimi- 
nant power is estimated in classification process with respect to two classes 
of stimuli: stimuli of interest and null stimuli of all uninterested features. 
Each location of visual field can be classified whether it includes stimuli 
of interest optimally with lowest expected probability of error. From pure 
computational standpoint, the binary classification for discriminant features 
are widely studied and well-defined as tractable problem in the literature. 
Moreover, the discriminant saliency concept and the decision theory appear 
in both top-down and bottom-up problems with different specifications of 
stimuli of interest [5].fT3]. 

The early stages of biological vision are dominated by the ubiquity of 
"centre-surround" operator; then, bottom-up saliency is commonly defined 
as how certain the stimuli at each location of central visual field can be 
determined against other stimuli in its surround. In other words, "centre- 
surround" hypothesis is a natural binary classification problem which can be 
solved by well-established decision theory. In this problem, classes can be 
defined as follows. 

• Interest hypothesis: observations within a central neighborhood of 
visual fields location /. 

• Null hypothesis: observations within a surrounding window Wf 3 of the 
above central region. 

At each location, likelihood of either hypothesis depends on the visual stim- 
ulus, of a predefined set of features X. The saliency at location I should be 
measured as discriminating power of features X in W} against features X in 
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W®. In other words, discriminant saliency value is proportional to distance 
between feature distributions of centre and surrounding classes. 

In mathematics, feature responses within the windows are drawn from 
the predefined feature sets X in a process. Since there are many possible 
combinations and orders of how such responses are assembled, the obser- 
vations of features can be considered as a random process of dimension d. 
X{1) = (Xi(l), . . . ,Xd(l))- This random process is drawn conditionally on 
the states of hidden variable Y(l), which is either centre or surround state. 
Feature vector x(j) such that j G Wf, c G {0, 1}} are drawn from classes c 
according to conditional densities Px(i) |y(/)( x I c ) where Y(l) = for surround 
or Y(l) = 1 for centre. The saliency S at location /, S(l), is equal to the 
discriminant power of X for classifying observed feature vectors, which can 
be quantified by the mutual information between feature, X and class label, 
Y. 

S(l) = k{X-Y) = W PXtY (x,c)log Px f^f dx 

V J Px{x)Py(c) 

. Though binary classification and decision theory makes discriminant saliency 
computationally feasible, it is only for low-dimensional data. Computer vi- 
sion and visual attention need to deal with high-dimension input images 
especially when it involves statistics and information theory. As mentioned 
previously, observations of feature responses, X(l), are considered as a ran- 
dom process in rf-dimensional space. Mutual information estimation of high- 
dimension data encounters serious obstacles due to the curse of dimension- 
ality as well as computational efficiency As these problems persist, saliency 
algorithms would never be biologically plausible and computationally feasi- 
ble. Therefore, discriminant saliency algorithms have to be approximated 
by taking into account statistical characteristics of natural images as well as 
mathematical simplification. Dashan Gao and Nuno Vasconcelos have pro- 
posed a feasible algorithm for mutual estimation, mathematically formulated 
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as follows. 



I l (X;Y) = H(Y)-H(Y\X) 
E X (H(Y) + E Y \x[logP Y \x(c\x)}} 
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H(Y) + J2Py\x(c\x)logP Y \x(c\x) 



c=0 
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H{ y) + Yl p Y\x{c\x j )logP Y \x(c\x j ) 



c=0 



where H(Y) = — J2l=o Pyi^logPyic) is the entropy of classes Y and H(Y\X) 
—Ey\x [logPY\x(c\x)] is the conditional entropy of Y given X. Given a loca- 
tion 1, there are corresponding centres and surround Wf windows along 
with a set of associated feature responses x(j),j G Wi = Wf U W} . The 
mutual information can be estimated by replacing expectations with means 
of all samples inside the join windows W[. The conditional entropy H(Y\X) 
can be computed by analytically deriving MAP P(Y\X) given deployment 
of Generalized Gaussian Distribution (GGD) for transformed features in a 
binary classification problem. Let's name Gao's proposal for discriminant 
saliency computation as DIS; more details about DIS can be found in their 
publications [H] [13! [21] [22] ■ 

While DIS successfully defines discriminant saliency in decision theoretic 
sense, its implementation has certain limits. Feature responses are randomly 
sampled in a single fixed-size window; therefore, it is obviously biased toward 
objects with distinctive features fitted in that window size. As previous find- 
ings have confirmed involvement of multi-scale factors in visual attention, 
DIS needs extending from a fixed-scale process to a multi-scale process. In 
theory, DIS can be carried out with different size of windows, and this ap- 
proach certainly produces image responses and saliency values at multiple 
scales. However, such approach is not recommended for both computational 
and biologically mechanisms since it causes high redundancy in saliency val- 
ues across multiple scales. In order to solve the multi-scale problem system- 
atically, DIS should be integrated with multiple scale processing techniques 
such as wavelet transforms. 
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3. Visual Attention - Centre and Surround Context 

Discriminant Saliency (DIS) studies differences between two pixel distri- 
butions sampled from assigned windows called centre and surround windows. 
It intuitively fits into the concept of centre and surround operators, and the 
idea would work well if linear difference such as image contrast needs figuring 
out. However, deployment of mutual information, a high-order comparison 
between two distributions would require enormous amount of sampled pixels 
which enclosed image patches certainly can not afford. Therefore, centre- 
surround operators have to be considered in the informatics and statistics' 
point of view. 

Modelling 2-D data such as images is a complicated task due to infinite 
number of possible joint distributions. However, in the visual attention study, 
natural images are the only image class which are analysed. Then, it sim- 
plifies the task significantly due to numerous works on statistics of natural 
images in the literature j23j EU ESI EEl EH EH]- From several studies about 
statistical aspects of natural images, we notice importance of explicit and im- 
plicit manifolds by Shi and Zhu [29] . They have argued image patches as fun- 
damental elements for objects modelling and recognition in natural images. 
Moreover, these patches plays a key role in forming the whole ensemble of 
natural image patches which are in turn classified into two types of subspaces: 
explicit and implicit manifolds. The explicit manifolds mainly contain sim- 
ple and regular image primitives such as edges, bards, corners, and junctions 
while implicit manifolds are dominated with complex and stochastic image 
patches such as textures and clutters. Image scaling realizes a connection be- 
tween two types of manifolds; in other words, a specific point on images can 
be classified as either edge or texture according to the patch size. Further- 
more, image scales not only changes nature of image patches from implicit 
to explicit manifolds but as well affect their entropy since two subspaces live 
separately in low and high entropy regimes. Studying this transition over 
scale shows the peak of manifold complexity in the middle entropy regime 
|29j . In other words, there exists an intermediate scale which collects the 
most information of patches if information is measured by entropy value. 

The transition from implicit to explicit manifolds or vice verse over mul- 
tiple scales in natural images have strong correlation with centre-surround 
operators of visual attention theory. Centre or surround windows consider- 
ably correspond to smaller patches and bigger patches at the same location; 
in other words, they are patches at two consecutive scales. When two or 
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Figure 1: (L) : central contex (explicit manifold). (C): original image. (R): surrounding 
context (implicit manifold) 

more pairs of centre and surround patches are employed, multi-scale centre- 
surround operators are formed with blocks at different scales. As image 
scaling plays a key role in classification of patches into either implicit or ex- 
plicit manifold, centre and surround patches have different likelihood towards 
the manifolds. Therefore, DIS can be formulated as binary classification be- 
tween two different types of manifolds with statistical constrains instead of 
two distinguishing patches with geometrical constrains. It certainly benefits 
estimation of mutual information because DIS with statistical constrains can 
be learn from patches across images, the global context while the original DIS 
with geometrical constraint restrains the accuracy of information estimation 
due to its small number of pixels in local contexts. 

Deployment of DIS approach in deciding classes of image patches accord- 
ing to their statistical characteristics need mathematical models of informa- 
tion estimation for realization. Shi and Zhu proposes manifold pursuits along 
with the introduction of implicit and explicit manifolds. The pursuit uti- 
lizes iterative approaches EM algorithms to produce approximate statistical 
models of image patches (I) over the global context Q. In their approaches, 
training samples are image pixels, spatial features, which is again difficult for 
multi-scale extension. Alternatively, we try to model statistical characteris- 
tics of wavelet features with Gaussian Mixture Models (GMM), a good statis- 
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tical approximation of wavelet coefficients distributions from natural images 
as well as their usage of large- variance and small- variance Gaussian distribu- 
tions. Again, these large/small variance or implicit /explicit distributions are 
just two aspects of one phenomenon assumed that Gaussian distributions are 
suitable for modelling features of natural images. For instances, large vari- 
ance Gaussian distribution means large uncertainty or high-entropy regime 
where implicit manifold live on; meanwhile, small variance Gaussian distri- 
bution has small uncertainty or low entropy regime where explicit manifolds 
exists. Therefore, statistical constrains for DIS in the global context can uti- 
lize large/small- variance Gaussian distributions instead of implicit /explicit 
manifolds concepts due to its convenience in mathematical models. In this 
paper, GMM concepts are seen through for modelling wavelet coefficients, 
spatial-scale features. 

4. Multiscale Discriminant Saliency 

Expansion from a fixed window-size to multi-scale processing is commonly 
desired in development of computer vision algorithms. In long literature of 
computer vision research fields, there are many multi-scale processing mod- 
els applicable for a so called Multiscale Discriminant Saliency (MDIS). Any 
selection has to adapt binary classification in multi-scale stages. Put dif- 
ferently, it requires efficient classification of an image datum into a class 
at a particular scale with prior knowledge from other scales. With respect 
to these requirements, a multi-scale image segmentation framework should 
be a great starting point for MDIS since DIS can be considered as simpli- 
fied binary image segmentation with only two classes. However, the binary 
classification is only an intermediate step to measure discriminant power of 
center-surrounding features, and accuracy of segmentation results does not 
really matter in this case. 

Typical algorithms employ a rigid classification window in a vague hope 
that all included pixels belong to the same class. Obviously, DIS has similar 
problems of choosing suitable window sizes as well. Clearly, the size is cru- 
cial to balance between its reliability and accuracy. A large window usually 
provides rich statistical information and enhance reliability of the algorithm. 
However, it also risks including heterogeneous elements in the window and 
eventually loses segmentation accuracy. Therefore, appropriate window sizes 
are equivalently vital in avoidance of local maxima in discriminant power. If 
window sizes are too large or too small, MDIS risks losing useful discrimina- 
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tive features or being too susceptible to noise. In brief, sampling rates, and 
consequently number of data in a window, directly affect on performance of 
binary classification / segmentation and eventually computation of discrim- 
inating powers. 

4-1. Multiple Classification Windows 

Multi-scale segmentation employs multi-scale classification windows on 
images, then combines responses across scales. MDIS can adapt similar ap- 
proach to model classification image features into either centre or surrounding 
classes. Though window sizes are preferred to be chosen arbitrarily, dyadic 
squares ( or blocks ) are implemented in MDIS because of its compactness 
and efficiency. Let's assume an initial square image s with 2 J x2 J of n := A 
pixels, the dyadic square structures can be generated by recursively dividing 
x into four square sub-images equally, left-side of the figure [2j As a result, it 
has the popular quad-tree structure, commonly employed in computer vision 
and image processing problems. In this tree structure, each node is related 
to a direct above parent node while it plays a role of parental nodes itself for 
four direct below nodes [2} Each quad-tree node is equivalent with a dyadic 
square, and is denoted as a tree-node in scale j by d\ whereof i is a spatial 
index of a dyadic square node. Given a random field image X, the dyadic 
squares are also random fields which are formulated as D\ mathematically. 
In following sections, we sometime use Di (dropping scale factor j) as general 
randomly- generated dyadic square regardless of scales. Using these dyadic 
squares as classification windows, we can classify each node d\ as either cen- 
tre or surround by estimating its maximum a posterior probability (MAP). 
Then, mutual information between features and corresponding labels can be 
computed by averaging MAP across all classes. The mutual information is 
similar to the core concept of discriminant power of central features against 
surrounding ones at each location [5j. However, deployment of quad-tree 
structures makes the estimation of mutual information possible for many 
scales. In order to compute the discriminant power, multiple PDF need to 
be learned through wavelet-based statistical models. 

4-2. Multi-scale Statistical Model 

Hidden Markov Model (HMM) captures the main statistical features of 
wavelet transforms of real-world images. In one hand, parameters originally 
have to be learn for each data point in the raw model of HMT; however, such 
training make it unwieldy for handling images with large amount of data. In 
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Figure 2: Quad- Tree and Dyadic Wavelet Structures 



the other hand, arbitrarily specified parameters would help to avoid a time- 
consuming learning stage which is even impossible in some cases, but would 
risk of over-fitting the model. The above issues may render HMT inappropri- 
ate for applications with rapid processing requirement but without sufficient 
priori information. Therefore, several derivatives of HMT algorithms are 
studied in this section in order to realize HMT more feasible manner as well 
as find out their advantages and limitations. 

Marginal distributions of wavelet coefficients Wi come directly from sparse- 
ness of wavelet transform in modelling real-word images: a minority of large 
coefficients and a majority of small coefficients. That distribution is effi- 
ciently captured by Gaussian Mixture Modelling (GMM) with wavelet coef- 
ficients Wi and observed hidden state variables or class labels Si G S, L. State 
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value Si decides which mixture generates coefficients Wi. 



2 1 (x - /i) 2 

g{x\n,(T ) := -r^exp 

Lets denotes the Gaussian PDF of small- variance state Si — S is as follows. 

f(wi\Si = S) = g(wi;Q,o%.J 
While state St = L has zero-mean, large-variance Gaussian 

f(wi\Si = L) = g(wij 0, 

where <r| > c|. By those mixture models, we can write the marginal PDF 
f(wi) as a convex combination of the conditional densities. 

f( Wi ) = pfg( Wi ; 0, <r| ;i ) + pf 0, <j|.J 

where pf + pf = 1 since = [pfpf ] are mass probability of states. In 
statistical interpretation, ps t is how likely wavelet coefficients Wi is small or 
large. 

HMT captures inter-scale dependencies of probabilistic tree connecting 
hidden state variables of a wavelet coefficient and its four children. There- 
fore, dependency graphs have similar quad-tree topology as wavelet decom- 
position, and it includes state-to-state links between parent and child coef- 
ficients, mathematically modelled by persistency probabilities and novelty 
probabilities. 

Ai -- 



Pi Pi 
n~*L — — 
Pi Pi 



where pf~^ s + pf^ L = 1 and pf^ s + p\ ^ L = 1. Persistency probabilities 
are lying on the main diagonal axis of the above array pf~ 5>5 ,pf~ 5>L , since 
they represent how likely states are kept in parent and child links. On the 
sub diagonal axis are novelty of probabilities, which causes different states 
between parent and child nodes. 

In summary, trained HMT models can be specified in terms of (i) GSM 
mixture variances o"| ;i and crf^ (ii) the state transition matrix and (iii) 
probability mass function p 1 at the coarsest level. Grouping these parameters 
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into a model A4 C} we can define trained HMT model as follows 

M c = {p 1 ,A 2 ,...,A J ;a Si .,k,j,b} (2) 
where 

j = 1,...,J 
Si = {L,S} 
k e z 2 
V6 e B 

where k is wavelet coefficients index at scale j while b represents wavelet 
sub-bands. 

B = {HL,LH,HH} 

The parametric model A4 C can be used to learn from the joint PDF f(w\Ai c ) 
of the wavelet coefficients in each of three sub-bands [30]. Generally, each 
node of wavelet coefficients has its own model with different parameters. 
However, it overcomplicates the model with too many parameters; for exam- 
ples, n wavelet coefficients are required to be fitted on 4n parameter models 
which is an impossible task. Therefore, to reduce the complexity, HMT is 
assumed to use same parameters for every nodes at the same scale of wavelet 
transforms regardless of spatial k and oriental b indexes. 

2 2 
a S;b,j,k - a S;j 

a L;b,j,k = a L;j 
Ab, j, k = Aj 

keZ 2 , VbeB 

The assumption is called tying within scale (30] and it prevents over-complex 
and infeasible HMT model in exchange of less general model, which is math- 
ematically formulated as follows. 

M c = { Pl , A 2 ,..., Aj; a Siij , (j = 1, . . . , J, $ = L, S)} (3) 

Lets name the Ai c in equation [3] as trained Hidden Markov Model (THMT). 
While the typing trick has significantly simplified the learning process for 
HMT parameters. In fact, the model only needs training on an input image. 
However, it possible to further simplify the approach if image class are known. 
Using many images with similar contexts, we can train HMT offline for meta- 
parameters , which are later fixed in an HMT model. It yields a general HMT 
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model for that class of images with each member in the class being treated 
statistically equivalent [31J. Fortunately, Romberg .et .al has studied similar 
models for the class of natural images and publish their universal parameters 
obtained by (jointly) fitting lines to the HMT parameters of four images as 
follows. 



a s 


= 3.1 


C as 


= 2 X 1 


a L 


= 2.25 


C aL 


= 2 X 1 


As 


= 1 


Css 


= 2 2 - 3 


Al 


= 0.4 


Cll 


= 2°.5 




= 0.5 



The variance and persistence decays are measured by fitting a line to the log 
of the variance versus scale for each state [31], and it is only started at scale 
j = 4 with transition state probability at j = 5. This choice of scale ensures 
enough data for an accurate estimate of decays; moreover, Romberg .et .al 
[31] finds out that the decays are very similar for many of the natural images. 
Therefore, it is reasonable to fix a HMT model with meta-parameters which 
are learnt from natural images. This approach of modelling HMT is called 
Universal HMT (UHMT) in this paper. Though accuracy of this UHMT 
model is clearly lost by treating all different images statistically equivalent, 
the assumption can totally eliminate the need for training and save tremen- 
dously computational workload, and make real-time HMT possible. Experi- 
ments with UHMT mode are mentioned in the section [5] where both accuracy 
and efficiency of UHMT is evaluated. 

As UHMT approach eliminates training stages of THMT to decrease com- 
putational requirement for real-time applications, sometimes it is necessary 
to have better HMT in modelling image. Though tying THMT assumes 
the same parameters for wavelet sub-bands or coefficients at different ori- 
entations, its underlying learning treats each brand independently. How- 
ever, experiments by Simoncelli and Portilla [32] demonstrate importance 
of cross-correlation between wavelet sub-bands with different orientations at 
the same scale for modelling texture image. Moreover, textural features are 
pretty common in natural images as well; therefore, capturing this depen- 
dency would improve accuracy of HMT models. Since THMT treats coeffi- 
cients of sub-bands independently, it obviously ignores the cross-orientation 
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correlation. To enhance the capacity of THMT, Do and Vetterli [27] pro- 
poses to group coefficients at the same location and scale into a vector and 
carry out HMT modelling in a multidimensional manner. In this paper, we 
call such approach as Vectorized HMT (VHMT). Let's denote a vector after 
the grouping of wavelet coefficients at location k, scale j in three different 
orientations vertical (V), horizontal (H), and diagonal (D) as follows. 

Wj jfc = {wf )k wl k wf )k ) T 

As multidimensional groups of wavelet coefficients Wj^ are concerned, their 
distribution need formulating as zero-mean multivariate Gaussian density 
with covariance matrix C as follows 

q(w, C) = — exp(— w T C -1 w) 

where n is the number of dimensions, or orientations n = 3 in this case. Ex- 
cept from a multivariate probability density, VHMT has similar approach as 
other HMTs do. Its marginal distribution is formulated as Gaussian Mixture 
Models, i.e. 

f J (w)=pfg(w;Cf)+p^g(w;Cf) 

Moreover, its statistical inter-scale dependency is modelled through the parent- 
child relationship with a quad-tree structure linking a parent with its four 
children at the next level in the same location. A small difference here is 
that only one tree is utilized instead of three trees since wavelet coefficients 
have been grouped and modelled simultaneously. Hence, an image can be 
modelled by VHMT with a set of parameters. 

e = { Pl ,A 2 ,...,Aj ] C^,(j = l,...,J,S l = L,S)} 

As only one quad-tree is built for modelling a wavelet coefficients vector , the 
hidden state are as well "tied up" . It means the same hidden state is assigned 
regardless of orientations; in other words, VHMT is orientation-invariant. 
VHMT captures dependencies across orientations via the covariance matrix 
C of the multivariate Gaussian density. Diagonal elements of the matrix 
are variances of each orientation meanwhile non-diagonal elements represent 
covariance of wavelet coefficients across sub-bands. It justifies VHMT model 
for textural features since their wavelet coefficients have high possibility of 
being significant at all orientations in edge regions whereas they are likely 
small at any directions in smooth regions. 
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All three derivatives (U/T/V)HMT have distinguishing approaches to a 
common goal: capturing statistical characteristics of natural images. Each 
approach has aimed to emphasize different aspects; for example, THMT mod- 
els general parent-child dependency, UHMT eliminates the need for train- 
ing by utilization of meta-parameters for a certain image category, while 
VHMT emphasizes on directionally invariant features. Implementation of 
those derivatives has both positive and negative effects, which are further 
elaborated in the section [U 

4-3. Multi-scale Statistical Learning 

The complete joint pixel PDF is typically overcomplicated and difficult to 
model due to their high-dimensional nature. Unavailability of simple distri- 
bution model in practice motivates statistical modelling of transform-domain 
which is often less complex and easier to be estimated. Obviously, joint pixel 
PDF could be well approximated as marginal PDF of transformed coeffi- 
cients. Since wavelet transform well-characterizes semantic singularity of 
natural images, it provides a suitable transform-domain for modelling statis- 
tical property of singularity-rich images. 

Natural images are full of edges, ridges and other highly structural fea- 
tures as well as textures; therefore, wavelet transforms as multi-scale edge 
detectors well represent such singularity rich contents at multiple scales and 
three different directions. Noted that, only normal discrete wavelet trans- 
form (DWT) is considered in this study for the sake of simplicity though 
concepts can be adapted into other wavelet transforms as well. Henceforth, 
whenever a wavelet transform is mentioned, it refers to DWT instead of 
stating otherwise. As multi-scale edge detectors, responses of wavelet trans- 
form overlying a singularity are large coefficients while DWT of a smooth 
region yields small coefficients. Simple hard-threshold of wavelet coefficients 
leads to binary differentiation of singularity against non-singularity features. 
Moreover, statistical model of images can be significant simplified under "re- 
structured" multi-scale singularity representation. 

Quad-tree structure of dyadic squares in pixel domain can be mirrored 
in wavelet decomposition, the right-hand side of figure [2] since four wavelet 
coefficients at a given scale nest inside one at the next coarser level. For 
example, Haar wavelet coefficients at each quad-tree nodes are generated 
by Harr wavelet transform of the corresponding dyadic image square, which 
is clearly illustrated by projection from a combination of LL 2 , HL 2 , LH 2 , 
and HH 2 to LL%. The combination of singularity detector and multi-scale 
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quad-tree structure implies the singularity property at each spatial location 
persists through scales along the branches of the quad-tree in the transform 
domain. 

The singularity characterization along scales makes the wavelet domain 
well-suited for modelling natural images. In fact, statistical models of wavelet 
coefficients have quite comprehensive literature; however, we only concen- 
trate on the Hidden Markov Tree of Crouse, Nowak and Baraniuk |3QJ. In 
consideration of both marginal and joint statistics of wavelet coefficient, the 
HMT model introduces a hidden state variable of either "large" or "small" 
to each quad-tree node at a particular scales. Then, the marginal density of 
wavelet coefficients is modelled as a two-states Gaussian mixture in which 
a "large" or "small" refers to characteristics of Gaussian distribution's vari- 
ance values. The mixtures closely match marginal statistics of natural images 
[33], [3l], [35]. With the HMT, persistence of large or small coefficients are 
captured across scales using Markov-1 chain. It models dependencies between 
hidden states across scale in a tree structure, parallel to that of wavelet co- 
efficients and dyadic squares. With parameters of Gaussian Mixture Model 
(GMM) and Markov State Transition in vector Ai, the HMT model is able 
to approximate the overall joint pdf of the wavelet coefficients W by a high- 
dimensional but highly structured Gaussian mixture models f(w\Ai). 

Highly structural nature of wavelet coefficients allows efficient implemen- 
tation of HMT-based processing. The parameters of a HMT model Ai can be 
learned through the iterative expectation and maximization (EM) algorithm 
with cost 0(n) per iteration [30] in (T/V)HMT or predefined for a particu- 
lar image category [31 J in UHMT. After the parameters Ai are estimated by 
the EM iteration, we need to compute statistical characteristics of wavelet 
coefficients given the DWT coefficient w of an image x and a set of HMT 
parameters Ai. It is a realization of the HMT model in which computation 
of the likelihood f(£j\At) requires only a simple 0(n) up-sweep through the 
HMT tree from leaves to root [30J. 

Conveniently, wavelet-based HMT models use similar structures as wavelet 
decomposition and dyadic squares do. Therefore, statistical behaviours of 
each square block di can be approximately computed by a HMT branch 
rooted at a node i. As mentioned earlier, maximum likelihood of the sub- 
tree % is computed simply by up-sweeping from corresponding leave-nodes 
at scale S = J to the root node at scale S = j. If the "up-sweeping" op- 
eration is done from leave nodes S = J to root node at scale S = 0, we 
could find out likelihood probability of the whole image [30] . The estimation 
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at intermediate scales j gives the statistical behaviour of the corresponding 
dyadic sub-square ,f(di\A4), under the HMT model. 

The above model opens a prospect of a simple multi-scale image clas- 
sification algorithm. Supposed center and surround classes are denoted as 
c G {1,0}, we have specified or trained an HMT trees for each class with 
parameters Ai c . Then the above likelihood calculation is deployed on each 
node of the HMT quad-tree given the wavelet transform Co of an image x. 
For each node of the tree, HMT yields the likelihood f(di\A4 c ),c G {1, 0} for 
each dyadic block di. With the multi-scale likelihoods at hand, we can easily 
choose the most suitable class c for a dyadic sub-square d ; as follows. 

c¥ L ■= argmax ce{1:0 }f(di\M c ) 

The most likely label for each dyadic sub-square di can be found by 
a simple comparison between estimation of available classes. Moreover, the 
approach is also capable of handling large number of input data such as 
images because of linear computational cost, 0(n) operations for an entire 
n-pixel image. 

4-4- Multiscale Likelihood Computation 

For a given set of HMT model parameters Ai, it is straight forward 
to compute the likelihood f(w\A4) by up-sweeping from leaves nodes to 
the current node in a single sub-band branch [30]. Moreover, likelihoods of 
all dyadic squares in the image can be obtained simultaneously in a single 
operation along the tree as well as Hidden Markov Tree (HMT) and Discrete 
Wavelet Transform (DWT) utilizes same dyadic squares in the quad-tree 
structure. 

To obtain the likelihood of a sub-tree % of wavelet coefficients rooted 
at Wi, we have deployed wavelet HMT trees and learn parameters G for 
multiple levels [3D]. The conditional likelihood /3j(m) := f(7l\Si = m, 6) can 
be retrieved by sweeping up to node i (see [3D]); then, the likelihood of the 
coefficients in % can be computed as follows. 

/(T|e) = Y, &( c )p( s * = c l ) ( 4 ) 

m=S,L 

with p(Si = c|9) state probabilities can be predefined or obtained during 
traning [31] . 
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Due to similarity between wavelet coefficients sub-trees and dyadic squares, 
it is obvious that pixels of each square block di are well represented by three 
sub-bands or sub-trees {% LH , % HL , % HH }, whereof all likelihood are indepen- 
dently calculated by equation [4] in their corresponding trees . Independent 
estimation of three bands is an appropriate computation step because of the 
assumption that correlation between feature channels would not affect dis- 
criminant powers Gao [14] . Furthermore, DWT is known as de-correlation 
tools as well, and decorrelated signals are linearly independent from each 
other. Hence, the likelihood of a dyadic square is formulated as product of 
three independent likelihoods of wavelet sub-bands at each scale. 

f{di\M) = f(% LH \e LH )f(% HL \e HL )f(% HH \e HH ) (5) 

Noteworthy that, assumption of independent sub-bands are only necessary in 
the (U/T) HMT model while VHMT has grouped all coefficients into a single 
vector. Therefore, it only needs one quad-tree representation of multivariate 
coefficients, T, and the likelihood of dyadic squares under each tree node is 
formulated as follows. 

f(di\M) = fine) (6) 

Using the equation [5j the likelihood can be computed for each dyadic square 
down to 2x2 block scale. Noteworthy, sub-band LL of the wavelet transform 
is not utilized in our computation since it is low-passed approximation of 
original images. Therefore, it is vulnerable to pixel brightness and lightning 
conditions of scenes. As natural images are considered as main testing cate- 
gory, shades and extreme brightness conditions can not be avoided; then, the 
final LL sub-band is discarded. The above simple formulation of likelihood 
is usually employed in block-by-block or "raw" classification since it does not 
exploit any possible relationship at different scales. Moreover, classification 
decisions between classes (centre and surround) are lack of inheritance across 
dyadic scales because a process of likelihood estimation at a scale is isolated 
from processes at other levels. Therefore, a better classifier can be achieved 
by integrating prior knowledge of other scales or at least the direct coarser 
scale. 

4-5. Multi-scale Maximum a Posterior 

In the previous section, only "raw" binary classification between two 
states have been realized under the wavelet Hidden Markov Tree model [30J . 
Given a prior knowledge from other scales, implementation of better binary 
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classification for DIS and MDIS, in the equation [2j needs a posterior proba- 
bility p^clldi) whereof and d? = d\ are class labels and features of an image 
at a dyadic scale j and location i. 

In order to estimate the MAP p(c^|<#), we need to employ Bayesian ap- 
proach for capturing dependencies between dyadic squares at different scales. 
Though many approximation techniques [36] , [3T] , [SB] , [21] are derived for a 
practical MAP, the Hidden Markov Tree (HMT) by Choi |40j is proven to be 
a feasible solution. Choi [ID] introduces hidden label tree modelling instead 
of joint probability estimation in high-dimensional data of dyadic squares. 
Due to strong correlation between the square under inspection and its par- 
ents and their neighbours, the decisions of class labels for these adjacent 
squares would affect the decision at the considered square. For example, if 
the parent square belong to a certain class and so do their neighbours, the 
child square most likely belongs to the same class as well. 

Modelling the parent-child relation is realized by a general probabilis- 
tic graph j3U] ; however, the complexity exponentially increases with number 
of neighbouring nodes. Choi [H] proposes an alternative simpler solution, 
based on context-based Bayesian approach. For the sake of simplicity, causal 
contexts are only defined by states of the direct parent node and its 8 inter- 
mediate neighbours. Let's denote the context for Di as v ; = [v it0 , v^i, . . . , v it g] 
where v^o refers to context from a direct parent node and their neighbours. 
The triple Vj — > Cj — > Di forms a Markov- 1 chain, relating prior context Vi 
and node features Di to classification decisions Cj. Moreover, class labels of 
prior contexts Vi are chosen as discrete values as it simplifies the modelling 
considerably. Given that prior context, independence can be assumed for 
label classification at each node; therefore, it is allowed to write. 

p(c>|v J ) = J]p(c?)|vj 

i 

The property of Markov-1 chain assumes that Di is independent from Vi 
given Ci\ therefore, the posterior probability of classifying c" given d^v-i is 
written as follows. 

, .... h /(dV)p(^|v j ) 

p(cJ|dJ - vJ >- / ( dV) 

As independence is assumed for label decisions in classifying processes, it 
yields. 
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and the marginalized context-based posterior 

/(^|d»,vJ)cx/(dj|^)p(cJ|v{) 

It greatly simplifies MAP posterior estimation since it no longer needs to 
deal with joint prior conditions of features and contexts. It only needs to 
obtain two separated likelihood of the dyadic square given the class value Cj, 
/(di J |c^) and prior context provided through Vi, p(c?|vj). 

While retrieving the likelihood f{d{'\c , i ) is straightforward by up-sweeping 
operations with given HMT model parameters at each scale, the complexity 
of prior context estimation greatly depends on its structures. Though more 
general structures may give better prior information for classification, it also 
greatly complicates the modelling and summarizing information conveyed by 
Vj as well. In other words, we run on the verge of context dilution, especially 
in case of insufficient training data [36] . [37] . [30] . 

To simplify but still guarantee generalization of prior information, we will 
employ a simple context structure inspired by the hybrid tree model [2S] in 
context-labelling trees. Instead of including all neighbouring sub-squares, 
the simplified context only involves labels from the parent square C p (i) and 
major vote of the class labels from neighbouring squares C^. As there are 
only two class labels N c := 0, 1, the prior context v ; := {C p M, C^} can only 
been drawn from N% = 4 different values 0, 0, 0, 1, 1, 0, 1, 1. Despite such ad- 
hoc simple contextual model, it provides sufficient statistic for demonstrating 
the effectiveness of multi-scale decision fusion |39j. Another advantage of the 
context structure simplification is not requiring enormous number of training 
data for probability estimation. 

Any decision about labels at a scale j depends on prior information of 
labels on a scale j — 1; therefore, we can maximize MAP [7j in multi-scale 
coarse-to-fine manner by fusing the likelihoods /(di|cj) given the label tree 
prior p(c^|vi). The fusion step help pass down MAP estimation through 
scales to enhance coherency between classifying results of consecutive scales. 
Moreover, a posterior probability of a class label given features and the 
prior context is computed and maximized coherently across multiple scale. 

c t MAP = argmax 4e01 f(4\d\vi) (7) 

4-6. Multiscale Discriminant Saliency 

Core ideas of DIS and MDIS are measuring discriminant power between 
two classes centre and surrounds. Though it can be estimated by sample 
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means of mutual information, the underlying mechanism is distinguishing the 
centre and surround classes given Generalized Gaussian Distribution (GGD) 
of wavelet coefficients. These distributions are usually zero-mean and well- 
characterized with only variance parameters. Dashan Gao p3] has estimated 
scale parameter (variance) of GGD (see section 2.4 [H] for more details) by 
the maximum a probability process. 

& MAP = ^(it\x(3)f + » 

The above MAP is later used for deciding whether a sample point or an 
image data point belongs to either the centre or surround class (see [H] 
for a detailed proof and explanation). Then, the more distinguishing MAP 
estimation of the centre class's variance parameter a.\ is from that of the 
surround class's variance parameter ao, there is more discriminant power for 
classifying interest from null hypothesis. 

Beside GGD, Gaussian Mixture Model (GSM) is a popular choice for 
modelling wavelet distributions with multiple classes' variances as well [33], [34 
In binary classification problem with only two classes, GSM includes two 
Gaussian Distribution mixtures (GD) of distinguishing variances, which are 
named as "large" / "small" states according to their comparison in terms of 
variance values. Now the only difference between GSM models and Gao's 
proposal [13] are whether GD or GGD should be used. Though GGD is 
more sophisticated with customizable distribution shape parameter (3, sev- 
eral factors support validity of simple GD modelling given the class condi- 
tions as hidden variables. Empirical results from estimation have shown that 
the mixture model is simple yet effective [33j,[39j. Modelling wavelet coeffi- 
cients with hidden classes of "large" /"small" variance states are basic data 
models in Wavelet-based Hidden Markov Model (HMT) [30]. With wavelet 
HMT, image data are processed in coarse-to-fine multi-scale manner; there- 
fore, MAP of a state C\ given input features from a sub-square Dj can be 
inherently estimated across scales j = 0, 1, . . . , J. More details about this 
multi-scale MAP estimation by wavelet HMT can be found in the previous 
sections 4J3 Then, a combination of MAP estimation, in the equation [7] 



and mutual information computation, in the equation [2j yields the MDIS 
mathematical formulation. 

l 
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where H{C^) = —p(C^)log(p(C^)) is entropy estimation of classes across 
scales j, and the posterior probability can be estimated by modelling wavelet 
coefficients in the HMT framework. This matter has been discussed in pre- 
vious sections; therefore, it is not repeated here. As the equation [8] yields 
discriminant power across multiple scales; a strategy is needed for combin- 
ing them across scales. In this paper, a simple maximum rule is applied 
for selecting discriminant values from multiple scales into a singular MDIS 
saliency map at each sub-square di. 

I l (C\B) = max(P i (C';jy)) (9) 

5. Experiments & Discussion 

In the light of decision theory, saliency maps are considered as binary fil- 
ters applied, at each image location. According to a certain saliency thresh- 
old, each spot can be labelled as interesting or uninteresting. If a binary 
classification map is considered as saliency map which leads to visual per- 
formance of human beings, it would be a significant undervaluation of hu- 
man visual attention system. Psychology experiments shows much better 
capacity of biologically plausible visual attention system than that of binary 
classification maps. Therefore, our proposed method just use binary clas- 
sification between centre-surround environment as an intermediate step to 
develop information-based saliency map. At each location, mutual informa- 
tion between distributions of classes and features shows strength of discrimi- 
nant power between interesting vs non-interesting classes given input dyadic 
sub-squares. From discrete binary values, the saliency representation has 
continuous ranges of discriminant power. Inevitably, the generated saliency 
maps become more correlated to the results of human visual attention maps. 

Besides suitable saliency representation, reliable ground truth data are 
necessary for evaluation. As our research purpose is deepening knowledge 
about multi-scale discriminant saliency approaches and human visual at- 
tention relation, the ground truth data must be gotten from psychological 
experiments in which human subjects look different natural scenes and their 
responses are collectively acquired. Moreover, the research scope only focus 
at bottom-up visual saliency, the early stage of attention without interfer- 
ence of prior knowledge and experiences. Human participants should be 
naive about aims of experiments and should not know contents of displaying 
scenes in advance. After these prerequisites are satisfied, human responses 
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on each scene can be accurately collected through eye-tracking equipments. 
It records collection of eye-fixations for each scene, and these raw data are 
basic form of ground truths for evaluating efficiency of saliency methods. 

Assumed that ground-truth data are available, quantitative methods can 
be applied for evaluation of MDIS Saliency results. In an effort of standard- 
izing evaluating process, we only utilize one of the most common and acces- 
sible database and evaluation tools in visual attention fields. With regard 
to available database and ground-truths, Niel Bruce database [9] is certainly 
the most popular dataset used in information-based saliency studies. While 
proposing his InfoMax (AIM) approach, the first information-based visual 
saliency, he simultaneously releases his testing database as well. The rea- 
sonably small collection with 120 different colour images which are tested 
by 20 different subjects. Each object observes displayed images in random 
orders on a 12 inch CRT monitor positioned 0.75 from their location for 4 
seconds with a mask between each pair of images. Importantly, no particular 
instructions are given except observing the image. 

Above brief description clarified validity of this database for our exper- 
iments. Besides that, the AIM method is involved as a reference method 
against which we compare our proposed saliency solution MDIS in terms of 
performance, computational load, etc. Though DIS [13] is the closest ap- 
proach to our proposed MDIS, implementation from the author is not avail- 
able for comparison. Meanwhile, AIM also derives saliency value from infor- 
mation theory with slightly different computation, self-information instead 
of mutual-information in MDIS or DIS. Therefore, it would be considered 
the second best as referenced method for our later evaluation of MDIS. 

As valid database is set, proper numerical tools are necessary for analysing 
simulation data. Regarding fairness and accuracy of the evaluation, we em- 
ploy a set of three measurements LCC, NSS, and AUC recommended by Ali 
Borji et al. jl3] since evaluation codes can be retrieved freely from their 
website Assessment with three evaluation scores ensures the reliability of 
qualitative observation and any conclusion is free from metric choices. First, 
linear correlation coefficient (LCC) measures linear relationship between two 
variables CC(G, S) = cov(G, S), where G and S are the standard deviation 
of a ground-truth and saliency maps. Values of LCC variates from -1 to 
+1 while the correlation changes from total inverse to perfect linear relation 



2 https://sites. google.com/site/saliencyevaluation/ 
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between G and S maps. 

While LCC measures similarity of saliency and fixation maps as a whole, 
normalized scan path saliency (NSS) treats eye-fixations as random variables 
which are classified by proposed saliency maps. The measurement is an 
average of saliency values at human eye positions according to each approach. 
NSS metrics are ranged from to 1; NSS = 1 indicates that saliency values at 
eye-fixation locations are one standard deviation above average; meanwhile, 
NSS = means no better performance of saliency maps than randomly 
generated maps. 

Different from previous two measurements dealing with saliency maps 
directly, the last quantitative tool AUC utilized saliency maps as binary 
classification filters with various thresholds. By regularly changing threshold 
values over the range of saliency values, we can have a number of binary 
classifying filters. Then, deploying these filters on eye-fixation maps produce 
several true positive rates (TPR) and false positive rate (FPR) as vertical 
and horizontal values of Receiver Operating Characteristics (ROC) curve. 
Area under curve (AUC) is a simple quantitative measurement to compare 
ROC of different saliency approaches. Perfect AUC prediction means a score 
of 1 while 0.5 indicates by chance-performance level. In addition to ROC, 
ISROC plots [7] are useful for clarifying performances of evaluated approaches 
against those of multiple human experiment subjects. 

As mentioned previously, AIM is chosen as the referenced information- 
based saliency method. It is chosen due to the well-established reputation as 
well as freely accessible code and experiment database Due to multiscale 
natures of the proposed MDIS, saliency maps for each dyadic-scale levels can 
be extracted as well as the final MDIS saliency, integrated across scales by 
the equation [8j The availability of saliency maps at multiple scales as well 
as combined one allows evaluation of discriminant power concept for saliency 
in scale-by-scale manner or on the whole. We denote integrated MDIS as 
HMTO; while separated saliency maps are named as HMT1 to HMT5 in ac- 
cordance with coarse-to-fine order. By examining MDIS in different aspects, 
we would observe its effectiveness in predictions of eye-fixation points and 
how selection of classifying window sizes might affect its performance. In ad- 
dition, comparisons against AIM would contribute a general view how MDIS 
performs against a well-known information-based saliency method. 



3 http://www-sop. inria.fr/members/Neil. Bruce/ 
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In our proposed methods, Hidden Markov Tree (HMT) plays a role of 
modelling statistical properties of images. It extracts model parameters 
at each scale by considering distribution of feature given hidden variables 
(centre-surround labels in our method). Therefore, training is a necessary 
step before model parameters can be approximated in THMT. Normally dif- 
ferent wavelet sub-bands are trained independently assumed no-correlation 
exists between them. However, HMT parameters can be acquired by group- 
ing wavelet coefficients and training these multivariate variables in VHMT 
models as well. Furthermore, training steps are not necessarily needed if 
input images are all from a specific category; for example, natural images in 
this paper. Romberg et al. [21] have studied this Universal Hidden Markov 
Tree (UHMT) for the natural image class. In other words, model parameters 
can be fixed without any training efforts; the approach would greatly reduce 
computational load for MDIS. Each configuration of the HMT model has 
its own advantages and disadvantages; therefore, a simulation is necessary 
to compare performances of (U/T/V) HMTs in terms of LCC, NSS, AUC 
and TIME ( the computational time ). Noteworthy that, we will employ 
(U/T/V) HMTs instead of HMT only when representing experimental data 
in order to signify which tree-building method is employed. 

In quantitative method, general ideas can be drawn about how the pro- 
posed algorithms perform in average. However, such evaluation method lacks 
of specific details about successful and failure cases. The information has 
been averaged out in quantitative method. In an effort of looking for pros 
and cons of the algorithm, we perform qualitative evaluations for saliency 
maps generated by MDIS in multiple scales. Furthermore, AIM saliency 
maps are generated and compared with MDIS maps to discover advantages 
and disadvantages of each method. 

5.1. Quantitative Evaluation 

After general review of how simulations are built and evaluated in the 
previous section, following are data representation and analysis of the con- 
ducted experiments. In this paper, five dyadic scales are deployed for any 
HMT training and evaluation; therefore, we have simulation modes from 
(U/T/V)HMT(l-5) of MDIS depending on whether training stages is de- 
ployed (T/V)HMT or universal parameters are used (UHMT). Saliency maps 
could be combined according to the maximization of mutual information rule, 
the equation [8j therefore, we have three (U,T,V)HMT0 modes for saliency 
maps which are created by across-scale integration. AIM is involved in the 
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simulations as the reference method; meanwhile, LCC, NSS, AUC and TIME 
are chosen as numerical evaluation tools. Below are two tables of simulation 
results. Table [T] shows experimental data of all universal HMT modes while 
table |2] summarizes data of all trained HMT modes. 



Table 1: UHMT - MDIS - DATA 



Observations 


LCC 


NSS 


AUC 


TIM 


UHMTO 


0.01434 


0.21811 


0.89392 


0.39617 


UHMT1 


-0.00269 


0.19772 


0.53862 


0.39617 


UHMT2 


0.01294 


0.27819 


0.60520 


0.39617 


UHMT3 


0.01349 


0.32868 


0.69065 


0.39617 


UHMT4 


0.01604 


0.42419 


0.83615 


0.39617 


UHMT5 


0.00548 


0.13273 


0.89234 


0.39706 


AIM 


0.01576 


0.12378 


0.72275 


50.41714 



Table 2: THMT - MDIS - DATA 



Observations 


LCC 


NSS 


AUC 


TIM 


THMT0 


0.02382 


0.48019 


0.88357 


2.32734 


THMT1 


0.02582 


0.38096 


0.60922 


2.32734 


THMT2 


0.01156 


0.31855 


0.64633 


2.32726 


THMT3 


0.01604 


0.32491 


0.71972 


2.32726 


THMT4 


0.01143 


0.29662 


0.81192 


2.32726 


THMT5 


0.00512 


0.36932 


0.89532 


2.32726 


AIM 


0.01576 


0.12378 


0.72353 


50.41714 



Table 3: VHMT - MDIS - DATA 



Observations 


LCC 


NSS 


AUC 


TIM 


VHMTO 


0.01697 


0.44170 


0.86606 


2.84212 


VHMT1 


0.01693 


0.38387 


0.61187 


2.84212 


VHMT2 


0.02044 


0.38777 


0.67060 


2.84212 


VHMT3 


0.01430 


0.38882 


0.73682 


2.84212 


VHMT4 


0.00946 


0.36761 


0.82329 


2.84212 


VHMT5 


-0.00125 


0.39580 


0.88160 


2.84212 


AIM 


0.01576 


0.12378 


0.72400 


50.41714 



Among numerical evaluation tools of visual saliency, Receiver Operating 
Curve (ROC) and its Area Under Curve (AUC) are the most popular. It 
measures efficiency of saliency maps in classifying fixation and non-fixation 
points of human eye movements in visual psychological experiments. In ROC 
curve, the vertical axis indicates True Positive Rate of the classification which 
is equal to hit rate, recall measurement. It is the ratio between correctly clas- 
sified fixation points and its total number. Meanwhile, False Positive Rate 
(FPR) is equivalent with fall-out ratio, the number of incorrectly classified 
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ROC Curves Inter-subject ROC Curves 




(a) UHMT - MDIS - ROC (b) UHMT - MDIS - ISROC 



ROC Curves Inter-subject ROC Curves 




(c) THMT - MDIS - ROC (d) THMT - MDIS - ISROC 



ROC Curves Inter-subject ROC Curves 




(e) VHMT - MDIS - ROC (f) VHMT - MDIS - ISROC 

Figure 3: (U/T/V)HMT - MDIS - (IS)ROC 

fixation points over the total number of non-fixation points. In brief, ROC 
curves represent a successful rate (TPR) of guessing eye-fixation points at a 
particular falling rate (FPR) over a normalized continuous range of threshold 
from to 1. When a threshold is raised up to the maximum saliency value, 
there are no interested points. Then, it is the case when both hit and fall 
rates are 0, (FPR, TPR) = (0, 0). When the bar is lowered gradually, the re- 
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call and fall-out rates increase bit by bit as well. If the recall increases faster 
than the fall-out rate, a binary classification performs well since it tends to 
make a correct decision than wrong decisions about which locations are in- 
teresting. Otherwise, it is unreliable in the choice of salient points. Until 
the minimum threshold is reached, both recall and falling out rates becomes 
maximum, (FPR,TPR) = (1, 1), the case of no interesting points. 

Figures [3(a)|3(c)||3(e)1 display ROC curves of UHMT, THMT, VHMT 



for several scales with AIM as reference curves. In all figures, solid green 
lines are representing ROCs of AIM, the reference method; while, blue, red, 
and orange colours represent ROCs of UHMT, THMT, and VHMT conse- 
quently. In general, AIM has an average performance in comparisons of 
other HMT with different scales and window sizes. It performs better than 
all HMT when large 16x16 or 32x32 windows are utilized. When 8x8, 4x4, 
and 2x2 blocks are employed, HMTs surpass AIM in detection of fixation 
points. In other words, HMT with these scales produces more meaningful 
maps than AIM does. Noteworthy that AIM utilizes 30x30 windows, almost 
equivalent with (U/T/V) HMT5's window sizes; however, AIM generates 
overlapping squares instead of distinct blocks. Utilization of such large win- 
dows and high-dimensional data requires training steps and samples prepa- 
ration which require enormous computational effort. Application of machine 
learning in continuously sliding windows explains superiority of AIM over 
(U/T/V) HMT with similar block-sizes. However, performance of HMTs in 
terms of ROC shapes rapidly over-perform AIM if we shrink block sizes from 
HMT1 to HMT5, in other words, increase the resolution of output saliency 
maps. 

In comparisons of (U,T,V) HMTs, we can see advantages of (T,V)HMT 
over UHMT on different scales. It is clearly shown in three corresponding 
ROC figures |3lap(cj] [3(e)] since curves of (T,V)HMT moves closer toward 



the left-top corner than UHMT does. It means (T,V)HMT more success- 
fully detect meaningful points than UHTM does in region of low fall-out 
rate (FPR). Therefore, (T,V)HMT provides better binary classifier in term 
of robustness. It is reasonable since THMT, VHMT requires training steps 
on image data meanwhile UHMT model just uses predefined and general pa- 
rameters. Furthermore, VHMT has slightly better performance than THMT 
in most window sizes. The slight increment in terms of ROC curves is due to 
the fact VHMT is a better model than THMT in processing texture features 

J- 

Generalized ROC evaluates performances of saliency maps with different 



30 



thresholds, according to ground-truth eye-fixation data of several human test 
subjects. However, it does not distinguish eye-fixation data from each sub- 
ject but treats the data collectively. Therefore, it certainly loses important 
aspects of eye-tracking data from multiple subjects such as diversity of in- 
dividual responses over various types of scenes shown during experiments. 
For example, a simple scene with only one simple subject on the blank white 
background would cause similar eye-movement from all subjects. However, 
these responses are rather diverge when scenes with two or more foreground 
subjects on complex experiments are shown to participants. To visualize the 
diversity of subjects' responses and corresponding performances of saliency 
maps, we utilize evaluation method similar to Inter-Subject ROC approach 
of Harel .et .al |7], whereof saliency methods are evaluated against diversity 
in subjects' responses. The horizontal axis, inter-subject AUC, averagely 
measures how similarly an eye-fixation map of a subject resembles the map 
generated by eye-tracking data from the other subjects. If the value is small, 
it means not much consistency across subjects' performance due to com- 
plex scenes. Otherwise, it shows that simple scenes results in agreement of 
responses from most subjects. 

Applying the ISROC evaluation method for the proposed HMTs and the 
reference AIM saliency methods generates figures 3(b)[3(d) , and 3(f) for (U) 
universal, (T) trained and vector-based configuration. Generally, all saliency 
methods have quite consistent performances across different complexity of 
scenes. In other words, they could produce meaningful saliency maps in 
complex cases where inter-subject scores are low, human subjects are strug- 
gling to find common salient points. However, these computational saliency 
maps are overcomplicate in simple scenes with high inter-subject ROC scores 
. In these cases, all subjects focus on a few locations on testing scenes, while 
the proposed saliency methods still detect other regions of interest beside 
the main objects. Therefore, the proposed computational approaches are 
less efficient than human beings in such situations. 

Similar to ROC curves, ISROC curves help to compare HMTs with dif- 
ferent configurations on multiple levels as well. Observing figure [3] shows 
advantages of HMTs in terms of ISROC curves when smaller block sizes are 
chosen. Moreover, the reference method AIM's performance, the solid green 
line, is better than HMTO (16x16 blocks) and HMT1 (32x32 blocks), equiv- 
alent with HMT3 (8x8 blocks), but worse than HMTO, HMT4 (2x2 blocks ), 
and HMT5 (4x4 blocks). HMTO, HMT4 and HMT5 have better performance 
than human subjects in most of situation; their ISROC curves are above the 
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black boundary set by human subjects for most of the plots. Noteworthy, 
there is an interesting observation about HMTO - the integrated map, and 
HMT5 - 2x2 blocks that HMT5 is almost equivalent with or sometimes better 
than the integrated method HMTO. Perhaps, more complex rules for inte- 
grating saliency maps need developing to efficiently integrate attention maps 
with different scales. 

Shapes of ROC and ISROC curves give general ideas that HMTs overper- 
form the reference AIM map; however, it does not specify how much better 
the proposed methods are. Hence, it is necessary to have numerical analy- 
sis of their performances. Computational loads are the first measure to be 
analysed and compared among saliency approaches. In the TIME rows of 
tables [TJreftab:thmt and [TJ we present necessary processing time for each 
method or each mode. Generally, computational loads, proportional to pro- 
cessing time, of all modes in either UHMT or THMT row are almost similar 
since the parameters of full-depth Hidden Markov Tree need estimating be- 
fore computation of saliency values. In comparison of (U/T/V) HMTs in 
terms of processing time, UHMT is faster than (T/V)HMT as UHMT uses 
predefined parameters instead of learning HMT parameters from each image. 
When comparing (U/T/V) HMT modes of MDIS with AIM, our proposed 
methods are much faster than AIM. The well-known AIM directly estimates 
self information from high- dimensional by ICA algorithm while MDIS sta- 
tistically models two hidden states: "large" "small" states in sparse and 
structural features. Computational load or processing time of the mentioned 
AIM and proposed MDIS with different modes can be seen in the figure 4(d)| 
Though HMT-MDIS significantly reduces computational load for computa- 
tion of information-based saliency, more verifications are necessary for their 
performances in terms of accuracy. We begin with evaluating three modes 
(U/T/V)HMT separately with AIM in terms of three numerical tools LCC, 
NSS, AUC together, Figure gaj) |4(b~jj [i(c)| Then all modes of HMTs are 



summarized in three plots ( the top row of Figure |5| ) in the following order 
NSS, LCC, AUC from left to right. Especially in the figure |5j simulation 
modes of the same scale level are placed next to each other for example 
(U/T/V) HMTO sit next to each other, so do (U/T/V) HMT1 and etc. It 
is intentionally arranged in that way to compare performances of different 
simulation modes in the same scale level. Noteworthy that, in both tables [T] 
and [2] for each row is identified maximum and minimum values by using cor- 
responding text styles. Identification of extreme values only involves deriva- 
tives (U,T,V) HMT of MDIS modes. In the figures |4(a)[ |4(b)[ |4(c)[ extreme 
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Figure 4: Performance of UHMT-MDIS, THMT-MDIS in AUC,NSS,LCC and TIME 



(maximum or minimum) values are also specially marked. For example, 
maximum values have big solid markers while big but empty ones represent 
minimum points. Especially, AIM has big markers with distinguishing big 
cross-board texture while integrated saliency modes of (U/T/V)HMT0 have 
small cross-board textures. These special markers help highlight interesting 
facts in comparison among HMTs-MDIS or MDIS against AIM. The same 
marking policy is applied for data representation in the figure [5j Meanwhile, 
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each line in this figure has an arrow head for showing trends of experimental 
data (increasing/decreasing) when simulation modes are changed across U,T, 
or V configurations of HMT for each scale level. 




UHMTO THMTO VHMTO UHMT1 THMTl VHMTL UHMT2 THMTZ VHMT2 UHMTi THMTJ VHMTJ UHMT4 THMT4 VHMT4 UHMT5 THMT5 VHMT5 AIM ■ AIM 



Figure 5: Summary of all MDIS against AIM 

Firstly, the MDIS approach with universal parameters for each level of 
hidden Markov tree is analysed in terms of accuracy since it requires very 
little effort in saliency computation. Obviously, that fact raises a question 
about its accuracy of centre and surround classifier as well as synthesized 
saliency maps. According to simulation data in the table [T] with highlighted 
extrema, UHMT performs pretty well against AIM in all three measure- 
ments LCC, NSS and AUC. For example, MDIS with UHMT4 mode ( 4x4 
square blocks ) surpasses AIM in all measurements. It confirms validity 
and efficiency of our proposed methods in the information-based saliency 
map research field. When performances of different UHMT-MDIS modes are 
considered, UHMT4 with 4x4 squares have the most consistent evaluation 
among all dyadic scales with maximum LCC and NSS and the second best 
AUC value. UHMTO-MDIS, integration of saliency values across scales, does 
not have better performance than other UHMTs except for AUC level. It 
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shows inconsistent side of deploying HMT with predefined universal param- 
eters while no training effort is done for adapting the model into multi-scale 
statistical structures. 

Secondly, training stages are included in the simulation of MDIS with 
(T/V)HMT mode (Trained Hidden Markov Tree). With additional adap- 
tivity, (T/V)HMT might improve the saliency evaluation and produce more 
consistent results than UHMT might. This subjection is solidified by simula- 
tion data in the tables 2|3 and they are also plotted in the figures 4(b)[4(c)[ 
As observed in the table, all maximum values locate at the THMTO column, 
THMTO-MDIS over-performs AIM in all evaluating schemes. Again, the ra- 
tionale of MDIS is confirmed and practically proved. Furthermore, effective- 
ness of training stages are clearly shown when comparing THMTO against 
UHMTO. Though AUC of THMTO is smaller than that of UHMTO, THMTO 
evaluation are better their counterparts in both NSS and LCC schemes. 
This confirms usefulness of training Hidden Markov Tree models for each 
sample image. In addition, the figure 4(b)[4(c) shows supremacy of (T,V) 
HMTO modes, the across-scale integration mode of MDIS over other singular 
saliency maps at different dyadic scales in any measurement. Noteworthy, 
that LCC of THMTO mode is a bit smaller than LCC of THMT1 mode; 
however, this small difference can be safely ignored. Comparison of UHMT- 



MDIS and THMT-MDIS mode-by-mode between data in Table 4(a) and 



Table 4(b) are shown in the figure [5] Accordingly, there are slight improve- 
ments of (T/V)HMT1,(T/V)HMT2 over UHMT1, UHMT2, equivalence of 
THMT3, UHMT3, and a reverse trend that UHMT4-5 are comparable or 
slightly better than THMT4, THMT5. It seems that training processes are 
more important when big classification windows are used. Meanwhile uni- 
versal approaches of HMT work pretty well if dyadic squares get smaller. 
Two possible reasons for this observation are statistical natures of dyadic 
squares and characteristics of training processes. A bigger square has richer 
joint-distribution of features; therefore, UHMT with fixed parameters can 
not marginally approximate that distribution well. However, (T,V)HMT pa- 
rameters models can be learn from analysing images; it results in significant 
improvement of quality of saliency maps. While smaller sub-squares are less 
statistically distinguishing, they are successfully modelled by universal pa- 
rameters of HMT. Then training processes might become redundant since 
UHMT would perform as well as THMT would do. 
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Figure 6: Saliency Maps 1A 



5.2. Qualitative Evaluation 

In this section, saliency maps are analyzed qualitatively or visually. From 
this analysis, we want to identify (i) on which image contexts (U/T/V) HMT- 
MDIS work well, (ii) how scale parameters affect formation of saliency maps, 
and (iii) how MDIS in general is compared with AIM. 

In figures 6][7 the first example with central objects shows an example of 
good (U/V)HMT performance but bad THMT performance. All scale levels 
of THMT suppress features of the most obvious objects in the image cen- 
ter. Meanwhile, (U/V)HMT4 and (U/V)HMT5 capture significant features 
points of that objects; therefore, (U/V)HMT0 have much better saliency 
map than THMTO. In this case, the best saliency map of MDIS approach, 
(U/V)HMT0, is reasonably competitive against AIM one. Despite different 
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Figure 7: Saliency Maps IB 



configurations, UHMT-MDIS and VHMT-MDIS have quite similar results for 
a simple scenes with few central objects. Observing the figures 6(a)|7(a) and 
6(c)|7(c) we can see differences between generated saliency maps of UHMT 
and VHMT across six different modes [0-5] especially mode 5. This mode 
utilizes the smallest 2x2 blocks; therefore, it can create saliency maps with 
great details about discrepant regions between (U/V)HMT maps . Gener- 
ally, UHMT tends to include more irrelevant areas than VHMT does; while 
VHMT only focuses on regions richer of edges and textures. It is coherent 
with quantitative comparisons between UHMT and VHMT in the tables [TJ 
|3j However, there exists exceptions like the case in figure [7] where VHMT 
misses edges and texture of foreground objects but wrongly focus on small 
black circles on the background. Though universal mode does not have on- 
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Figure 8: Saliency Maps 2A 



line training, the predefined models are learnt by off-line training of many 
images. Therefore, it robustly performs in many cases which others fail to op- 
erate well. Meanwhile, saliency maps generated by VHMT-MDIS are much 
attracted to small black dots on background due to rotational invariance 
nature of the vector-HMT model |27j . 

Different from previous examples, the figures 8j9 show an opposite case in 
general outdoor scenes which (T/V)HMT produce more reasonable saliency 
maps for. While UHMTO map covers the whole region of sky despite no 
interesting features, (T/V)HMT0 correctly focus on interesting objects on 
the scene. Similarly, (V/T)HMT does extract more meaningful features than 
UHMT does (see the figures 8|9). In addition, the best saliency map THMTO 
or THMT5 highlights more discriminant features than AIM saliency map. 
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(c) VHMT 



Figure 9: Saliency Maps 2B 



Noteworthy, there are significant differences between VHMT1 and THMT1 
saliency maps. While THMT1 over-emphasizes edge points and lines be- 
tween appeared textures such as trees and the sky, the VHMT1 is ridged 
with highlighted regions and does not hint any standing-out areas. Again 
this discrepancy in performance may be due to nature of orientation selection 
in each mode (T/V) as THMT favours only features of horizontal, vertical, 
diagonal directions; while VHMT mode, a rotation invariant scheme, does 
not have any oriental differentiation. The third pair of examples are cho- 
sen such that complex scenes are presented to the saliency methods. In 
the figure 10, there are several fruits on the shelf; it is considerably compli- 
cated due to richness of edges, textures, as well as colour. In general, both 
(U/T/V)HMT-MDIS and AIM only partially succeed in detecting saliency 
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Figure 10: Saliency Maps 3A 



regions from these images since none of them successfully highlight the fruit 



with different colour on the shelf ( the fruit inside a red circle, Figure 10 ). 
Though most MDISs for variety of scale levels, do not explicitly detect that 
fruit, UHMT3 and UHMT4 salency maps are able to highlight the location of 
that fruit ( see UHMT3 an UHMT4 saliency maps, Figure [l0| ) . The sample 
matches with the fact that UHMT4 data in the table [T] has extremely good 
performance in all evaluation schemes. Surprisingly, there are some cases 
when appropriate choices of scales and parameters of predefined HMT mod- 
els can over-perform all trained HMT models. The interesting example of 



the figure 10 opens another research direction about how HMT model can be 



learned and optimized; however, it is the question of another research paper. 
6. Conclusion 

In conclusion, the multiple discriminant saliency (MDIS), a multi-scale 
extension of DIS [2T] under dyadic scale framework, has strong theoretical 
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foundation as it is quantified by information theory and adapted to multiple 
dyadic-scale structures. The performance of MDIS against AIM is simulated 
on the standard database with well-established numerical tools; furthermore, 
simulation data proves competitiveness of MDIS over AIM in both accuracy 
and speed. However, MDIS fails to capture salient regions in a few complex 
scenes; therefore, the next research step are improving MDIS accuracy in 
such cases. In addition, implementation of MDIS algorithm in embedded 
systems is also considered as a possible research direction. 
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